base arm
- North America > United States > New York (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Education (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.96)
- Banking & Finance (0.93)
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (2 more...)
- Transportation > Air (0.67)
- Consumer Products & Services > Travel (0.46)
- Transportation > Air (0.68)
- Consumer Products & Services > Travel (0.46)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- Asia > China > Beijing > Beijing (0.04)
Combinatorial Multi-Armed Bandit with General Reward Functions
Wei Chen, Wei Hu, Fu Li, Jian Li, Yu Liu, Pinyan Lu
In this paper, unless otherwise specified, we use MAB to refer to stochastic MAB. MAB problem demonstrates the key tradeoff between exploration and exploitation: whether the player should stick to the choice that performs the best so far, or should try some less explored alternatives that may provide better rewards.
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Europe > Netherlands > South Holland > Dordrecht (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
Oracle-Efficient Combinatorial Semi-Bandits
Kim, Jung-hun, Vojnović, Milan, Oh, Min-hwan
We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve $\tilde{O}(\sqrt{T})$ regret using only $O(\log\log T)$ oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.
- Asia > South Korea > Seoul > Seoul (0.04)
- North America > United States (0.04)
- Europe > United Kingdom (0.04)
- Europe > France (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
- Information Technology > Data Science > Data Mining > Big Data (0.86)
- Information Technology > Artificial Intelligence > Natural Language (0.70)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.31)
Multi-Play Combinatorial Semi-Bandit Problem
Nakamura, Shintaro, Kuroki, Yuko, Chen, Wei
In the combinatorial semi-bandit (CSB) problem, a player selects an action from a combinatorial action set and observes feedback from the base arms included in the action. While CSB is widely applicable to combinatorial optimization problems, its restriction to binary decision spaces excludes important cases involving non-negative integer flows or allocations, such as the optimal transport and knapsack problems.To overcome this limitation, we propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can select a non-negative integer action and observe multiple feedbacks from a single arm in each round. We propose two algorithms for the MP-CSB. One is a Thompson-sampling-based algorithm that is computationally feasible even when the action space is exponentially large with respect to the number of arms, and attains $O(\log T)$ distribution-dependent regret in the stochastic regime, where $T$ is the time horizon. The other is a best-of-both-worlds algorithm, which achieves $O(\log T)$ variance-dependent regret in the stochastic regime and the worst-case $\tilde{\mathcal{O}}\left( \sqrt{T} \right)$ regret in the adversarial regime. Moreover, its regret in adversarial one is data-dependent, adapting to the cumulative loss of the optimal action, the total quadratic variation, and the path-length of the loss sequence. Finally, we numerically show that the proposed algorithms outperform existing methods in the CSB literature.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (4 more...)